What do the data look like?

Jupyter IPython notebooks, such as this one, allow you to run both Python code and, using 'magics' also shell commands. In this tutorial we'll use both, since we will be interfacing with a variety of software, as well as processing data.

First, let's look around in the directory using standard Linux commands. We can execute a shell command by preceding it with an exclamation mark.



In [2]:

    
!ls -lh ../data/reads









    



total 107M
-rw-r--r-- 1 root root 24M Oct  8 19:22 mutant1_OIST-2015-03-28.fq.gz
-rw-r--r-- 1 root root 24M Oct  8 19:23 mutant2_OIST-2015-03-28.fq.gz
-rw-r--r-- 1 root root 18M Oct  8 19:23 mutant3_OIST-2015-03-28.fq.gz
-rw-r--r-- 1 root root 24M Oct  8 19:23 mutant4_OIST-2015-03-28.fq.gz
-rw-r--r-- 1 root root 20M Oct  8 19:23 reference_OIST-2015-03-28.fq.gz

We see that there are five files four of these are mutants, and and one reference original sample. We will take a look inside one of the files and look at the distribution of read statistics.

The reads are in text files, which have been compressed using gzip, a common practice for storing raw data. You can look inside by decompressing a file, piping the output to a program called head, which will stop after a few lines. You don't want to print the contents of the entire file to screen, since it will likely crash IPython.



In [3]:

    
!gunzip -c ../data/reads/mutant1_OIST-2015-03-28.fq.gz | head -8









    



@M00923:134:000000000-A5ELA:1:2109:24002:5853
ATGCCTATATTGGTTAAAGTATTTAGTGACCTAAGTCAATAAAATTTTAATTTACTCACGGCAGGTAACCAGTTCAGAAGCTGCTATCAGACACTCTTTTTTTAATCCACACAGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCTGA
+
CCCCCGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGCGGFGGGGGGGGGGGGGGGDFFGGGGGGGGF
@M00923:134:000000000-A5ELA:1:2110:26800:16309
CCTATATTGGTTAAAGTATTTAGTGACCTAAGTCAATAAAATTTTAATTTACTCACGGCAGGTAACCAGTTCAGAAGCTGCTATCAGACACTCTTTTTTTAATCCACACCGAGACATATTGCCCGTTGCAGTCAGAATGAAAAGCTGAATA
+
-A-<-CCFDC6,C6C,@FEGGFD,CFEFGC@EFDFD,<,,,,,;E,6C@CFGA,6C,8C:++C<FFC<@F99E@<@,,,@,C96,6696F<E@EF<EF+4,A@A,9=,4,+8+>:<F,??,:EB,@?+@4,CFG;F,=,,4,,,@E=4=,4

gzip: stdout: Broken pipe

Each read in the fastq file format has four lines, one is a unique read name, one containing the sequence of bases, one +, and one containing quality scores. The quality scores correspond to the sequencer's confidence in making the base call.

It is good practice to examine the quality of your data before you proceed with the analysis. We'll use a popular tools called FastQC to do some exploratory analysis.



In [9]:

    
!fastqc ../data/reads/mutant1_OIST-2015-03-28.fq.gz









    



Started analysis of mutant1_OIST-2015-03-28.fq.gz
Approx 5% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 10% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 15% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 20% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 25% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 30% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 35% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 40% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 45% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 50% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 55% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 60% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 65% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 70% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 75% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 80% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 85% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 90% complete for mutant1_OIST-2015-03-28.fq.gz
Approx 95% complete for mutant1_OIST-2015-03-28.fq.gz
Analysis complete for mutant1_OIST-2015-03-28.fq.gz



In [10]:

    
from IPython.display import IFrame
IFrame('../data/reads/mutant1_OIST-2015-03-28_fastqc.html', width=1000, height=1000)









    Out[10]:

Key statistics

Basic Statistics. Reports number of sequences, and basic details
- Per base sequence quality. The distribution of sequence quality scored over the length of the read.
  - The quality scale is logarithmic. Notice that the quality degrades rapidly over the length of the read. This is a key characteristic of Illumina data, and product of their sequencing chemistry, which limits the upper read length to about 300 bp.

We can explore the contents of read files programmatically using a library within Python called Biopython. This allows to automate many tedious tasks.



In [6]:

    
import gzip
from Bio import SeqIO
with gzip.open("../data/reads/mutant1_OIST-2015-03-28.fq.gz", 'rt') as infile: # open and decompress input file
    for rec in SeqIO.parse(infile, "fastq"):  # start looping over all records
        print(rec)  #print record contents
        break # stop looping, we only want to see one record









    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-6-1bdbb365e9b1> in <module>()
      1 import gzip
----> 2 from Bio import SeqIO
      3 with gzip.open("../data/reads/mutant1_OIST-2015-03-28.fq.gz", 'rt') as infile: # open and decompress input file
      4     for rec in SeqIO.parse(infile, "fastq"):  # start looping over all records
      5         print(rec)  #print record contents

ImportError: No module named 'Bio'

You can see the methods associated with each object, suce as rec usig the dir command.



In [6]:

    
print(dir(rec)) # print methods associaat









    



['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'upper']

For example, we can reverse complement the sequence:



In [7]:

    
rec.reverse_complement()









    Out[7]:





SeqRecord(seq=Seq('TCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAA...CAT', SingleLetterAlphabet()), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])

There are lots of other interesting functions to explore!

Exercises and questions

Exercises should be done in Python or bash.

Write a for loop to run FastQC on all the samples, and examine their output.

If you look a the "Per base sequence quality" in FastQC, you'll see that the quality decreases. Why does that happen?
What does a score of 20 correspond to? (Hint: these are called phred scores)
Look at the quality scores associated with the first read in reads/mutant1_OIST-2015-03-28.fq.g named M00923:134:000000000-A5ELA:1:2109:24002:5853. What is the average error rate? How many errors can we expect per read?
Check out the "Sequence Duplication Levels" report. Why would there be duplicated sequences?